335 research outputs found

    Modeling SAGE tag formation and its effects on data interpretation within a Bayesian framework

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Serial Analysis of Gene Expression (SAGE) is a high-throughput method for inferring mRNA expression levels from the experimentally generated sequence based tags. Standard analyses of SAGE data, however, ignore the fact that the probability of generating an observable tag varies across genes and between experiments. As a consequence, these analyses result in biased estimators and posterior probability intervals for gene expression levels in the transcriptome.</p> <p>Results</p> <p>Using the yeast <it>Saccharomyces cerevisiae </it>as an example, we introduce a new Bayesian method of data analysis which is based on a model of SAGE tag formation. Our approach incorporates the variation in the probability of tag formation into the interpretation of SAGE data and allows us to derive exact joint and approximate marginal posterior distributions for the mRNA frequency of genes detectable using SAGE. Our analysis of these distributions indicates that the frequency of a gene in the tag pool is influenced by its mRNA frequency, the cleavage efficiency of the anchoring enzyme (AE), and the number of informative and uninformative AE cleavage sites within its mRNA.</p> <p>Conclusion</p> <p>With a mechanistic, model based approach for SAGE data analysis, we find that inter-genic variation in SAGE tag formation is large. However, this variation can be estimated and, importantly, accounted for using the methods we develop here. As a result, SAGE based estimates of mRNA frequencies can be adjusted to remove the bias introduced by the SAGE tag formation process.</p

    Bias correction and Bayesian analysis of aggregate counts in SAGE libraries

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Tag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power.</p> <p>Results</p> <p>Three new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context.</p> <p>Conclusions</p> <p>Several Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.</p

    Genomic approaches to research in lung cancer

    Get PDF
    The medical research community is experiencing a marked increase in the amount of information available on genomic sequences and genes expressed by humans and other organisms. This information offers great opportunities for improving our understanding of complex diseases such as lung cancer. In particular, we should expect to witness a rapid increase in the rate of discovery of genes involved in lung cancer pathogenesis and we should be able to develop reliable molecular criteria for classifying lung cancers and predicting biological properties of individual tumors. Achieving these goals will require collaboration by scientists with specialized expertise in medicine, molecular biology, and decision-based statistical analysis

    Analysis of multiplex gene expression maps obtained by voxelation

    Get PDF
    BackgroundGene expression signatures in the mammalian brain hold the key to understanding neural development and neurological disease. Researchers have previously used voxelation in combination with microarrays for acquisition of genome-wide atlases of expression patterns in the mouse brain. On the other hand, some work has been performed on studying gene functions, without taking into account the location information of a gene's expression in a mouse brain. In this paper, we present an approach for identifying the relation between gene expression maps obtained by voxelation and gene functions.ResultsTo analyze the dataset, we chose typical genes as queries and aimed at discovering similar gene groups. Gene similarity was determined by using the wavelet features extracted from the left and right hemispheres averaged gene expression maps, and by the Euclidean distance between each pair of feature vectors. We also performed a multiple clustering approach on the gene expression maps, combined with hierarchical clustering. Among each group of similar genes and clusters, the gene function similarity was measured by calculating the average gene function distances in the gene ontology structure. By applying our methodology to find similar genes to certain target genes we were able to improve our understanding of gene expression patterns and gene functions. By applying the clustering analysis method, we obtained significant clusters, which have both very similar gene expression maps and very similar gene functions respectively to their corresponding gene ontologies. The cellular component ontology resulted in prominent clusters expressed in cortex and corpus callosum. The molecular function ontology gave prominent clusters in cortex, corpus callosum and hypothalamus. The biological process ontology resulted in clusters in cortex, hypothalamus and choroid plexus. Clusters from all three ontologies combined were most prominently expressed in cortex and corpus callosum.ConclusionThe experimental results confirm the hypothesis that genes with similar gene expression maps might have similar gene functions. The voxelation data takes into account the location information of gene expression level in mouse brain, which is novel in related research. The proposed approach can potentially be used to predict gene functions and provide helpful suggestions to biologists

    A comparative analysis of the information content in long and short SAGE libraries

    Get PDF
    BACKGROUND: Serial Analysis of Gene Expression (SAGE) is a powerful tool to determine gene expression profiles. Two types of SAGE libraries, ShortSAGE and LongSAGE, are classified based on the length of the SAGE tag (10 vs. 17 basepairs). LongSAGE libraries are thought to be more useful than ShortSAGE libraries, but their information content has not been widely compared. To dissect the differences between these two types of libraries, we utilized four libraries (two LongSAGE and two ShortSAGE libraries) generated from the hippocampus of Alzheimer and control samples. In addition, we generated two additional short SAGE libraries, the truncated long SAGE libraries (tSAGE), from LongSAGE libraries by deleting seven 5' basepairs from each LongSAGE tag. RESULTS: One problem that occurred in the SAGE study is that individual tags may have matched to multiple different genes – due to the short length of a tag. We found that the LongSAGE tag maps up to 15 UniGene clusters, while the ShortSAGE and tSAGE tags map up to 279 UniGene clusters. Both long and short SAGE libraries exhibit a large number of orphan tags (no gene information in UniGene), implying the limitation of the UniGene database. Among 100 orphan LongSAGE tags, the complete sequences (17 basepairs) of nine orphan tags match to 17 genomic sequences; four of the orphan tags match to a single genomic sequence. Our data show the potential to resolve 4–9% of orphan LongSAGE tags. Finally, among 400 tSAGE tags showing significant differential expression between AD and control, 79 tags (19.8%) were derived from multiple non-significant LongSAGE tags, implying the false positive results. CONCLUSION: Our data show that LongSAGE tags have high specificity in gene mapping compared to ShortSAGE tags. LongSAGE tags show an advantage over ShortSAGE in identifying novel genes by BLAST analysis. Most importantly, the chances of obtaining false positive results are higher for ShortSAGE than LongSAGE libraries due to their specificity in gene mapping. Therefore, it is recommended that the number of corresponding UniGene clusters (gene or ESTs) of a tag for prioritizing the significant results be considered

    Analysis of the functional repertoire of a mutant form of survivin, K129E, which has been linked to lung cancer

    Get PDF
    Background Survivin is a protein that is normally present only in G2 and M-phases in somatic cells, however, in cancer cells, it is expressed throughout the cell cycle. A prosurvival factor, survivin is both an inhibitor of apoptosis and an essential mitotic protein, thus it has attracted much attention as a target for new oncotherapies. Despite its prevalence in cancer, reports of survivin mutations have mostly been restricted to loci within its promoter, which increase the abundance of the protein. To date the only published mutation within the coding sequence is an adenine > guanine substitution in exon 4. This polymorphism, which was found in a cohort of Korean lung cancer patients, causes a lysine > glutamic acid mutation (K129E) in the protein. However, whether it plays a causative role in cancer has not been addressed. Methods Using site directed mutagenesis we recapitulate K129E expression in cultured human cells and assess its anti-apoptotic and mitotic activities. Results K129E retains its anti-apoptotic activity, but causes errors in mitosis and cytokinesis, which may be linked to its reduced affinity for borealin. Conclusion K129E expression can induce genomic instability by introducing mitotic aberrations, thus it may play a causative role in cancer

    Microarrays for global expression constructed with a low redundancy set of 27,500 sequenced cDNAs representing an array of developmental stages and physiological conditions of the soybean plant

    Get PDF
    BACKGROUND: Microarrays are an important tool with which to examine coordinated gene expression. Soybean (Glycine max) is one of the most economically valuable crop species in the world food supply. In order to accelerate both gene discovery as well as hypothesis-driven research in soybean, global expression resources needed to be developed. The applications of microarray for determining patterns of expression in different tissues or during conditional treatments by dual labeling of the mRNAs are unlimited. In addition, discovery of the molecular basis of traits through examination of naturally occurring variation in hundreds of mutant lines could be enhanced by the construction and use of soybean cDNA microarrays. RESULTS: We report the construction and analysis of a low redundancy 'unigene' set of 27,513 clones that represent a variety of soybean cDNA libraries made from a wide array of source tissue and organ systems, developmental stages, and stress or pathogen-challenged plants. The set was assembled from the 5' sequence data of the cDNA clones using cluster analysis programs. The selected clones were then physically reracked and sequenced at the 3' end. In order to increase gene discovery from immature cotyledon libraries that contain abundant mRNAs representing storage protein gene families, we utilized a high density filter normalization approach to preferentially select more weakly expressed cDNAs. All 27,513 cDNA inserts were amplified by polymerase chain reaction. The amplified products, along with some repetitively spotted control or 'choice' clones, were used to produce three 9,728-element microarrays that have been used to examine tissue specific gene expression and global expression in mutant isolines. CONCLUSIONS: Global expression studies will be greatly aided by the availability of the sequence-validated and low redundancy cDNA sets described in this report. These cDNAs and ESTs represent a wide array of developmental stages and physiological conditions of the soybean plant. We also demonstrate that the quality of the data from the soybean cDNA microarrays is sufficiently reliable to examine isogenic lines that differ with respect to a mutant phenotype and thereby to define a small list of candidate genes potentially encoding or modulated by the mutant phenotype

    Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated.</p> <p>Results</p> <p>The goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted.</p> <p>Conclusion</p> <p>The Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.</p

    Combination of p53AIP1 and survivin expression is a powerful prognostic marker in non-small cell lung cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>p53AIP1 is a potential mediator of apoptosis depending on p53, which is mutated in many kinds of carcinoma. High survivin expression in non-small cell lung cancer is related with poor prognosis. To investigate the role of these genes in non-small cell lung cancer, we compared the relationship between p53AIP1 or survivin gene expression and the clinicopathological status of lung cancer.</p> <p>Materials and methods</p> <p>Forty-seven samples from non-small cell lung cancer patients were obtained between 1997 and 2003. For quantitative evaluation of RNA expression by PCR, we used Taqman PCR methods.</p> <p>Results</p> <p>Although no correlation between p53AIP1 or survivin gene expression and clinicopathological factors was found, the relationship between survivin gene expression and nodal status was significant (p = 0.03). Overall survival in the p53AIP1-negative group was significantly worse than in the positive group (p = 0.04); however, although survivin expression was not a prognostic factor, the combination of p53AIP1 and survivin was a significant prognostic predictor (p = 0.04). In the multivariate cox proportional hazard model, the combination was an independent predictor of overall survival (p53AIP1 (+) survivin (+), HR 0.21, 95%CI = [0.01–1.66]; p53AIP1 (+) survivin (-), HR 0.01, 95%CI = [0.002–0.28]; p53AIP1 (-) survivin (-), HR 0.01, 95%CI = [0.002–3.1], against p53AIP1 (-) survivin (+), p = 0.03).</p> <p>Conclusion</p> <p>These data suggest that the combination of p53AIP1 and survivin gene expression may be a powerful tool to stratify subgroups with better or worse prognosis from the variable non-small cell lung cancer population.</p
    corecore